Stochastic Gradient Bayesian Optimal Experimental Designs for Simulation-based Inference

Simulation-based inference (SBI) methods tackle complex scientific models with challenging inverse problems. However, SBI models often face a significant hurdle due to their non-differentiable nature, which hampers the use of gradient-based optimization techniques. Bayesian Optimal Experimental Design (BOED) is a powerful approach that aims to make the most efficient use of experimental resources for improved inferences. While stochastic gradient BOED methods have shown promising results in high-dimensional design problems, they have mostly neglected the integration of BOED with SBI due to the difficult non-differentiable property of many SBI simulators. In this work, we establish a crucial connection between ratio-based SBI inference algorithms and stochastic gradient-based variational inference by leveraging mutual information bounds. This connection allows us to extend BOED to SBI applications, enabling the simultaneous optimization of experimental designs and amortized inference functions. We demonstrate our approach on a simple linear model and offer implementation details for practitioners.


Introduction
Many scientific models are defined by a simulator that defines an output y determined by the inputs, or designs, to a system, ξ, and parameters that define how the model transforms the inputs to outputs, θ. Inferring a distribution of parameters given data p(θ|y, ξ) is of central importance in Bayesian statistics and can be seen as a form of solving the inverse problem for a given simulator (Lindley, 1972). In SBI, a simulator forms an implicit probability distribu-tion known as the likelihood p(y|θ, ξ) that is used with the prior of the model parameters p(θ) to estimate the posterior probability of the model parameters given the observed data, p(θ|y, ξ) (Cranmer et al., 2020). Recent SBI methods have use deep learning-based models to infer either the intractable likelihood or posterior using density estimators for both, or classifiers to estimate the likelihood-to-evidence ratio, p(θ|y,ξ) p(θ|ξ) = p(y|θ,ξ) p(y|ξ) = p(y,θ|ξ) p(θ)p(y|ξ) . However, inferring the likelihood, posterior, or ratio is a computationally expensive process that depends on observed data y o , to compute. Recent work questioned the validity of this expensive computational process used in SBI if using the wrong simulator for the true data generating process (Cannon et al.). Naïve conclusions can be made if using the wrong model of the underlying scientific phenomenon, or the model is not close enough to the real data generating process, which motivates the use of optimal experimental designs in SBI methods.
Bayesian optimal experimental design (BOED) has shown promise as a way to optimize experiments given a model, the simulator, and priors of the parameters of interest. BOED works by determining the information gain of a proposed experimental design, ξ, on the parameters of the model of interest (1) The information gain can only be evaluated after an experiment but another quantity, the Expected Information Gain (EIG), I(ξ), can be used as a proxy for the information gained in an experiment The intuition behind this process is we must ask ourselves, which experimental design and outcome would be most surprising given what we assume, or know, about the system when conducting the experiment. This can be rewritten into the form of calculating the mutual information between the observed data and unknown parameters Early BOED work focused on estimating the mutual information then using that estimate as the surrogate function in 1 arXiv:2306.15731v1 [cs.
LG] 27 Jun 2023 an outer optimizer, such as Bayesian optimization (Foster et al., 2019b;Kleinegesse & Gutmann, 2019). This double loop of optimization was inefficient and lead to development of methods to simultaneously optimize the design and mutual information in a single optimization process. However, this unified optimization depended on an unnormalized likelihood and posterior approximation (Foster et al., 2019a) or an implicit likelihood with a simulator that has a differentiable functional form (Kleinegesse & Gutmann, 2021).
We present a method to simultaneously optimize designs and the mutual information for the remaining set of models, implicit likelihoods without a differentiable simulator, which are typically used in the SBI literature. We additionally make a link to how we can use a generative model in Contrastive Precitive Coding. We show: • A differentiable objective for simultaneously optimizing the mutual information and likelihood for SBIbased models.
• A connection between Likelihood-Free based methods for BOED and contrastive ratio estimation (CRE) methods for SBI models.
• Experimental validation of the unified objective on a simple linear model.

Background
Previous work in SBI methods have focused on improving methods based on given, observed, data y o , (Papamakarios & Murray, 2016;Papamakarios et al., 2018;Durkan et al., 2020;Greenberg et al., 2019) whereas BOED has focused on determining an optimal design ξ * , based on various bounds of MI between y and θ. While these aims seem to be unrelated, we will show how they can be performed simultaneously for SBI methods that rely on potentially stochastic simulators that act as black-box functions.

Simulation-Based Inference
In many scientific disciplines, it is desirable to infer a distribution of parameters θ, of a potentially stochastic model, or simulator, given observations, y o . The closed-box simulator may depend on random numbers z, such as in stochastic differential equations, and previous experimental designs ξ, such that the simulator takes the form y = g(θ, ξ, z). When a likelihood is not available, Approximate Bayesian Computation (ABC) methods can be used, (Sisson et al., 2018) which aim to infer the likelihood of parameters of the simulator that are within an ϵ ball, B ϵ (y), of the observed data y := y o , resulting in the likelihood p(∥y −y o ∥ < ϵ|θ). However, recent SBI methods have outperform ABC methods in inference tasks (Lueckmann et al., 2021). By using a simulator to simulate the joint data distribution (θ, y) ∼ p(y|θ), drawn from a prior θ ∼ p(θ), we can obtain an amortized likelihood p ϕ (y|θ) or posterior p ϕ (θ|y) by training a neural density estimator, such as a normalizing flow, with parameters ϕ, or estimate the likelihood-to-evidence ratio exp f ϕ (θ, y) ≈ p(y|θ) p(y) , by training a classifier to distinguish parameters used to simulate an observed values, y. Different SBI methods can be used in inference for downstream applications depending on the desiderata of the inference task. For example, one might use an amortized posterior approximation if there are many different data samples to evaluate, whereas an ensemble of ratios (Hermans et al.) has been shown to perform more robustly on Simulation-Based Calibration (SBC) tests (Talts et al.) at the cost of increased computational complexity.
There are many SBI methods proposed for approximating the likelihood, posterior, or ratio. We review the relevant ones to our method here. See (Lueckmann et al., 2021) for a more thorough review and benchmark of SBI methods.
Neural Likelihood Estimation We can use data from the joint distribution to train a conditional neural densitybased likelihood function. If we take a dataset of samples {y n , θ n } 1:N obtained from a simulator as previously described, we can train a conditional density estimator p ϕ (y|θ) to model the likelihood by maximizing the total log likelihood of n log p ϕ (y n |θ n ), which is approximately equivalent to minimizing the loss (4) where the Kullback-Leibler divergence is minimized when p ϕ (y|θ) approaches p(y|θ). SBI methods would then condition this likelihood on observed data, y o , and refine the likelihood estimate by resetting the prior to become the new posterior samples via Markov Chain Monte Carlo (MCMC) sampling of the approximate likelihood p(θ) := p(θ|y o ) ∝ p ϕ (y o |θ)p(θ) and training a new neural density estimator of the likelihood (Papamakarios et al., 2018;Lueckmann et al., 2018). This is Sequential Neural Likelihood (SNL) which we forego as we focus on the preliminary step of optimizing an experimental design without y o .

Bayesian Optimal Experimental Design
Following from equation 3, (Foster et al., 2019a) proposed the prior contrastive estimation (PCE) lower bound of the MI where the expectation is over p(θ 0 )p(y|θ 0 , ξ)p(θ 1:L ) and ξ is the proposed design, θ 0 is the original parameter that generated data y, and L is the number of contrastive samples. The PCE bound is appropriate in BOED when the prior and posterior are similar enough that p(θ) is a suitable proposal distribution for p(y|ξ). This bound has low variance but is upper-bounded by log L, potentially leading to large bias but still demonstrated adequate performance on various benchmarks. Unfortunately, this bound requires a tractable likelihood function, which is not available in SBI applications.

Likelihood Free PCE
We take inspiration from previous SBI and BOED methods to allow optimization of designs with respect to closed-box simulators that are modeled using normalizing flows. We start by noting how the loss function of contrastive ratio estimation (CRE) (Durkan et al., 2020) where L is the number of contrastive samples and f ϕ is a discriminative classifier, which holds for a single batch of data and constant experimental design, i.e. when ξ is constant. We exchange an explicit likelihood in PCE with a neural density estimator to create Likelihood-Free PCE (LF-PCE). We now have a MI lower bound where the expectation is over p(θ 0 )p(y|θ 0 , ξ)p(θ 1:L ). We now can simultaneously optimizes designs and parameters of a neural density estimator. If we are to use a normalizing flow as exp f ϕ (y, θ, ξ) = p ϕ (y|θ, ξ), then the PCE lower bound of the MI holds since the distribution is normalized as normalizing flows are bounded functions (Papamakarios et al., 2019). We note that this can be an unstable objective as the data distribution of the flow will change as experimental designs change. However, the result is that it returns an amortized likelihood that can be evaluated on observed experimental data to return a posterior density or used in downstream inference algorithms, such as SNL. Finally, using a normalizing flow allows us to take gradients with respect to designs ξ, which we derive in Appendix A.

Connection to Generative MI Estimation
The mutual information bound proposed by (Foster et al., 2019a) for PCE is similar to Contrastive Predictive Coding (CPC) (Poole et al., 2019;Oord et al., 2018), but where a generative model replaces a discriminative one and the random variable X corresponds to observed data and random variable Y to the prior distribution. In our formulation the bound of the MI depends on both the amount of training tr → ∞ and number of contrastive samples L → ∞ to approach the true MI. The generative approach to CPC can be simplified as where P is a random variable representing the joint distribution we obtain from our simulators (x, y) ∼ p(x|y)p(y) and p ϕ (x) implicitly depends on the number of contrastive samples L to approximate the marginal likelihood.

Noisy Linear Model
We follow (Kleinegesse & Gutmann, 2020) and evaluate optimal designs on a classic noisy linear model where a response variable y has a linear relationship with experimental designs ξ, which is determined by values of the model parameters θ = [θ 0 , θ 1 ], which model the offset and gradient. We would like to optimize the value of D measurements to estimate the posterior of θ, and so create a design vector ξ = [ξ 1 , . . . , ξ D ] T . Each design, ξ i returns a measurement y i , which results in the data vector y = [y 1 , . . . , y D ] T . We assume non-Gaussian noise sources, otherwise evaluating the posterior and MI would be trivial. We use a Gaussian noise source N (ϵ; 0, 1) and Gamma noise source Γ(ν; 2, 2). The model is then where ϵ = [ϵ 1 , . . . , ϵ D ] T and ν = [ν 1 , . . . , ν D ] T are i.i.d. samples. We evaluate LF-PCE on this model and examine how changing the λ regularization parameter in (7) influences the resulting mutual information bound and design quality for both models.
For each design dimension, D, we randomly initialize designs ξ ∈ [−10, 10]. For LF-PCE, we chose N = 10, the number of non-contrastive samples y ∼ p(y|ξ, θ 0 ), and Figure 1. Comparison of the EIG across design dimensions, type of BOED, and λ regularization for the noisy linear model examining the moving average over N=10 samples. For the single design dimension, LF-PCE with no λ regularization outperforms in estimating a lower bound of the MI, which can translate to more informative experimental designs. In the higher-dimension design cases, LF-PCE increases its EIG with more designs, which is expected, but sees diminishing returns when expanding from 10D to 100D design evaluations. In the 100-dimensional design case, we see the benefit of using λ regularization to stabilize the training of a neural density estimator in high-dimensional input space at the cost of slightly lower EIG. M = 50 contrastive samples for all experiments. For the neural spline flow, we chose 5 bijector layers, each with 4 bins, and 4 resnet multilayer perceptrons, each with 128 dimensions, for the neural network-based conditional networks. For both the neural density estimator's parameters ϕ, and the designs ξ, we use the Adam optimizer (Kingma & Ba, 2014) with β 1 = 0.9 and β 2 = 0.99, with learning rate α = 1e −3 for the neural density estimator and α = 1e −2 for design optimization.
Examining the graph of the mutual information in Figure 1, we see that LF-PCE lower bound steadily increases for all values of lambda; however, the stability of the optimization of the generative model's parameters diverges in higher design dimensions whenever λ = 0. We see a general trend between exploration and exploitation in changing values of λ, where higher λ values lead to lower MI lower-bound estimates and potentially more homogenous designs.
Using LF-PCE we obtain an amortized neural density estimator of the likelihood that is able to perform inference on observed data evaluated at the optimal design. For example, p(θ|y o , ξ * ) ∝ p ϕ (y o |θ, ξ * )p(θ) by MCMC sampling. We evaluate the posterior densities after optimizing on the LF-PCE lower bound in Appedix B and can see the mean and interquartile range in Table 1. We note that we were able to arrive at accurate and precise posterior estimates using the neural density estimator that simultaneously optimized an optimal design ξ * , without any post-processing such as using SNL or Sampling Importance Resampling.

Discussion
We demonstrated a novel information bounds, I LF −P CE , to perform gradient-based BOED using black-box simulators present in many SBI applications and obtained lower bounds of the EIG on a toy model across a range of experimental design dimensions to showcase its scalability. Optimizing designs in SBI applications provides a valuable preconditioning step to typical sequential SBI methods such as SNL that are based on observed experimental designs. Sidestepping Bayesian optimization can also help to accelerate model testing and feedback from real-world data. Future work will examine the tradeoff between design diversity for improved entropy reduction and neural density estimator robustness, similar to the exploration and exploitation tradeoff present in Bayesian optimization.

A. Design Gradients of LF-PCE
For LF-PCE, we need unbiased gradient estimators of the information bounds. A normalizing flow can be seen as a reparameterized distribution, which allows for calculating the gradient with respect to designs ∇ ξ f −1 (u; θ, ξ). In practice, since we are evaluating the log probability of a data point, we would actually evaluate the inverse direction of a flow ∇ ξ f (y; θ, ξ) at the base distribution p u (u), which is usually a Gaussian distribution and evaluated by maximum likelihood.

B. Evaluation of Linear Model Designs and Posteriors
We evaluated the efficacy of the neural density estimator trained using the LF-PCE loss function to infer a held out true parameter value in Figure 2 by MCMC. We provide a quantitative evaluation of the posteriors in Table 1. The posteriors can be improved by computationally efficient methods such as Sampling Importance Resampling, or used in SBI algorithms that use sequential methods to refine the neural density estimator. Figure 2. Comparison of the prior density the posterior achieved by the different design dimensional normalizing flows evaluated at an optimal design p(θ|yo, ξ * ) ∝ p ϕ (yo|θ, ξ * )p(θ). The red cross denotes the true model parameters.

C. Evaluation of Posterior Predictive Distribution
As a reference, we plot the prior and posterior predictive plots for the 1-dimensional optimal design in Figure 3. An insight into the optimal experimental design problem is that the designs closer to where the prior distribution has more noise will lead to more clarification in a performed experiment, which is why the most optimal designs will be at the boundaries for the noisy linear model.